Search Result

Select

Sentiment classification of incomplete data based on bidirectional encoder representations from transformers

LUO Jun, CHEN Lifei

Journal of Computer Applications 2021, 41 (1): 139-144. DOI: 10.11772/j.issn.1001-9081.2020061066

Abstract （395）

PDF （921KB）（873）

Save

Incomplete data, such as the interactive information on social platforms and the review contents in Internet movie datasets, widely exist in the real life. However, most existing sentiment classification models are built on the basis of complete data, without considering the impact of incomplete data on classification performance. To address this problem, a stacked denoising neural network model based on BERT (Bidirectional Encoder Representations from Transformers) was proposed for sentiment classification of incomplete data. This model was composed of two components:Stacked Denoising AutoEncoder (SDAE) and BERT. Firstly, the incomplete data processed by word-embedding was fed to the SDAE for denoising training in order to extract deep features to reconstruct the feature representation of the missing words and wrong words. Then, the obtained output was passed into the BERT pre-training model to further improve the feature vector representation of the words by refining. Experimental results on two commonly used sentiment datasets demonstrate that the proposed method has the F1 measure and classification accuracy in incomplete data classification improved by about 6% and 5% respectively, thus verifying the effectiveness of the proposed model.

Reference | Related Articles | Metrics

Select

Moving object removal forgery detection algorithm in video frame

YIN Li, LIN Xinqi, CHEN Lifei

Journal of Computer Applications 2018, 38 (3): 879-883. DOI: 10.11772/j.issn.1001-9081.2017092198

Abstract （416）

PDF （862KB）（400）

Save

Aiming at the tampering operation on digital video intra-frame objects, a tamper detection algorithm based on Principal Component Analysis (PCA) was proposed. Firstly, the difference frame obtained by subtracting the detected video frame from the reference frame was denoised by sparse representation method, which reduced the interference of the noise to subsequent feature extraction. Secondly, the denoised video frame was divided into non-overlapping blocks, the pixel features were extracted by PCA to construct eigenvector space. Then, k-means algorithm was used to classify the eigenvector space, and the classification result was expressed by a binary matrix. Finally, the binary morphological image was operated by image morphological operation to obtain the final detection result. The experimental results show that by using the proposed algorithm, the precision and recall are 91% and 100% respectively, and the F1 value is 95.3%, which are better than those the video forgery detection algorithm based on compression perception to some extent. Experimental results show that for the background still video, the proposed algorithm can not only detect the tampering operation to the moving objects in the frame, but also has good robustness to lossy compressed video.

Reference | Related Articles | Metrics

Select

Probability model-based algorithm for non-uniform data clustering

YANG Tianpeng, CHEN Lifei

Journal of Computer Applications 2018, 38 (10): 2844-2849. DOI: 10.11772/j.issn.1001-9081.2018020375

Abstract （647）

PDF （1008KB）（375）

Save

Aiming at the "uniform effect" of the traditional K-means algorithm, a new probability model-based algorithm was proposed for non-uniform data clustering. Firstly, a Gaussian mixture distribution model was proposed to describe the clusters hidden within non-uniform data, allowing the datasets to contain clusters with different densities and sizes at the same time. Secondly, the objective optimization function for non-uniform data clustering was deduced based on the model, and an EM (Expectation Maximization)-type clustering algorithm defined to optimize the objective function. Theoretical analysis shows that the new algorithm is able to perform soft subspace clustering on non-uniform data. Finally, experimental results on synthetic datasets and real datasets demostrate that the accuracy of the proposed algorithm is increased by 5% to 50% compared with the existing K-means-type algorithms and under-sampling algorithms.

Reference | Related Articles | Metrics

Select

Classification of symbolic sequences with multi-order Markov model

CHENG Lingfang, GUO Gongde, CHEN Lifei

Journal of Computer Applications 2017, 37 (7): 1977-1982. DOI: 10.11772/j.issn.1001-9081.2017.07.1977

Abstract （565）

PDF （956KB）（367）

Save

To solve the problem that the existing methods based on the fixed-order Markov models cannot make full use of the structural features involved in the subsequences of different orders, a new Bayesian method based on the multi-order Markov model was proposed for symbolic sequences classification. First, a Conditional Probability Distribution (CPD) model was built based on the multi-order Markov model. Second, a suffix tree for n-order subsequences with efficient suffix-tables and its efficient construction algorithm were proposed, where the algorithm could be used to learn the multi-order CPD models by scanning once the sequence set. A Bayesian classifier was finally proposed for the classification task. The training algorithm was designed to learn the order-weights for the models of different orders based on the Maximum Likelihood (ML) method, while the classification algorithm was defined to carry out the Bayesian prediction using the weighted conditional probabilities of each order. A series of experiments were conducted on real-world sequence sets from three domains and the results demonstrate that the new classifier is insensitive to the predefined order change of the model. Compared with the existing methods such as the support vector machine using the fixed-order model, the proposed method can achieve more than 40% improvement on both gene sequences and speech sequences in terms of classification accuracy, yielding reference values for the optimal order of a Markov model on symbolic sequences.

Reference | Related Articles | Metrics

Select

Bayesian clustering algorithm for categorical data

ZHU Jie, CHEN Lifei

Journal of Computer Applications 2017, 37 (4): 1026-1031. DOI: 10.11772/j.issn.1001-9081.2017.04.1026

Abstract （638）

PDF （919KB）（504）

Save

To address the difficulty of defining a meaningful distance measure for categorical data clustering, a new categorical data clustering algorithm was proposed based on Bayesian probability estimation. Firstly, a probability model with automatic attribute-weighting was proposed, in which each categorical attribute is assigned an individual weight to indicate its importance for clustering. Secondly, a clustering objective function was derived using maximum likelihood estimation and Bayesian transformation, then a partitioning algorithm was proposed to optimize the objective function which groups data according to the weighted likelihood between objects and clusters instead of the pairwise distances. Thirdly, an expression for estimating the attribute weights was derived, indicating that the weight should be inversely proportional to the entropy of category distribution. The experiments were conducted on some real datasets and a synthetic dataset. The results show that the proposed algorithm yields higher clustering accuracy than the existing distance-based algorithms, achieving 5%-48% improvements on the Bioinformatics data with meaningful attribute-weighting results for the categorical attributes.

Reference | Related Articles | Metrics

Select

Relative importance index of dummy variables in regression model

LI Haichao, WANG Kaijun, HU Miao, CHEN Lifei

Journal of Computer Applications 2017, 37 (11): 3048-3052. DOI: 10.11772/j.issn.1001-9081.2017.11.3048

Abstract （851）

PDF （819KB）（625）

Save

To describe the qualitative attributes in the regression model, it is usually necessary to introduce dummy variables. For the regression equation with dummy variables, a method was proposed to describe the different importance of the different dummy variables in the regression equation. The sums of square due to regression with dummy variables were descomposed, including the sum of the dummy variable part and that of non-dummy variable part, and the proportions of the two parts was calculated in the regression equation, and the proportion was taken as the index of relative importance of every dummy variable in regression equations. In sets of Lending Club and Prosper network with nearly 100 thousand lending data, the experimental results about the influence of the purpose of loan on the borrowing success rate and the influence of credit grade on the borrowing rate show that compared with the traditional regression equation which only provides a dummy variable coefficient and cannot shows its importance, the proposed method can show the importance of different dummy variables, and provide an important means to quantitatively analyze the influence degree of qualitative independent variables on the dependent variable in the regression equation.

Reference | Related Articles | Metrics

Select

Soft subspace clustering algorithm for imbalanced data

CHENG Lingfang, YANG Tianpeng, CHEN Lifei

Journal of Computer Applications 2017, 37 (10): 2952-2957. DOI: 10.11772/j.issn.1001-9081.2017.10.2952

Abstract （521）

PDF （935KB）（672）

Save

Aiming at the problem that the current K-means-type soft-subspace algorithms cannot effectively cluster imbalanced data due to uniform effect, a new partition-based algorithm was proposed for soft subspace clustering on imbalanced data. First, a bi-weighting method was proposed, where each attribute was assigned a feature-weight and each cluster was assigned a cluster-weight to measure its importance for clustering. Second, in order to make a trade-off between attributes with different types or those categorical attributes having various numbers of categories, a new distance measurement was then proposed for mixed-type data. Third, an objective function was defined for the subspace clustering algorithm on imbalanced data based on the bi-weighting method, and the expressions for optimizing both the cluster-weights and feature-weights were derived. A series of experiments were conducted on some real-world data sets and the results demonstrated that the bi-weighting method used in the new algorithm can learn more accurate soft-subspace for the clusters hidden in the imbalanced data. Compared with the existing K-means-type soft-subspace clustering algorithms, the proposed algorithm yields higher clustering accuracy on imbalanced data, achieving about 50% improvements on the bioinformatic data used in the experiments.

Reference | Related Articles | Metrics

Select

High-dimensional data clustering algorithm with subspace optimization

WU Tao CHEN Lifei GUO Gongde

Journal of Computer Applications 2014, 34 (8): 2279-2284. DOI: 10.11772/j.issn.1001-9081.2014.08.2279

Abstract （261）

PDF （968KB）（405）

Save

A new soft subspace clustering algorithm was proposed to address the optimization problem for the projected subspaces, which was generally not considered in most of the existing soft subspace clustering algorithms. Maximizing the deviation of feature weights was proposed as the sub-space optimization goal, and a quantitative formula was presented. Based on the above, a new optimization objective function was designed which aimed at minimizing the within-cluster compactness while optimizing the soft subspace associated with each cluster. A new expression for feature-weight computation was mathematically derived, with which the new clustering algorithm was defined based on the framework of the classical k-means. The experimental results show that the proposed method significantly reduces the probability of trapping in local optimum prematurely and improves the stability of clustering results. And it has good performance and clustering efficiency, which is suitable for high-dimensional data cluster analysis.

Reference | Related Articles | Metrics

Select

Emotion classification with feature extraction based on part of speech tagging sequences in micro blog

LU Weisheng GUO Gongde CHEN Lifei

Journal of Computer Applications 2014, 34 (10): 2869-2873. DOI: 10.11772/j.issn.1001-9081.2014.10.2869

Abstract （210）

PDF （801KB）（458）

Save

Traditional n-gram feature extraction tends to produce a high-dimensional feature vector. High-dimensional data not only increases the difficulty of classification, but also increases the classification time. Aiming at this problem, this paper presented a feature extraction method based on Part-of-Speech (POS) tagging sequences. The principle of this method was to use POS sequences as text features to reduce feature dimension, according to the property that POS sequences can represent a kind of text.In the experiment,compared with the n-gram feature extraction, the feature extraction based on POS sequences at least improved the classification accuracy of 9% and reduced the dimension of 4816. The experimental results show that the method is suitable for emotion classification in micro blog.

Reference | Related Articles | Metrics